Third Lab

Today’s goals. Learn about:

  1. Interval estimation
    • Interval estimation for \(\mu\) under the Gaussian case, when \(\sigma^2\) is known.
    • Interval estimation for \(\mu\) under the Gaussian case, when \(\sigma^2\) is unknown.
    • Comparing two population means or proportions.

Workflow

  1. Consider examples related to theoretical concepts
  2. Using R to solve some basic questions and to understand important aspects

Let’s start

  1. Download the R code (It is briefly commented)
  2. You will find XXX where I will ask you to code a bit
  3. Part of the code after XXX, depends on successfully filling these line of code

Interval estimation

Interval estimation

Definition of confidence interval

  • The point estimate is in most cases a rude estimate. Rather, we may construct an interval for our parameter.

  • Recall that the confidence interval \(\hat \Theta(y)\) (or region in the case of dimension of the parameter greater than \(1\)) for \(\theta\) with confidence level \(1-\alpha\) is a subset of the parameter space \(\Theta\), defined as function of the data \(y\) and such that \[Pr_\theta(\theta \in \hat \Theta(Y))= 1-\alpha \]

Note

  • The l.h.s. of the previous formula is called (null) coverage probability;

  • A confidence interval can be obtained from the acceptance regions of a tests with level \(\alpha\) for \(H_0: \theta = \theta_0\) (we will see later the connection between hypothesis testing and confidence intervals).

  • Interval estimation is based on pivotal quantities (functions of the data and the parameter whose distribution is known).

Interval estimation for \(\mu\) under the Gaussian case, when \(\sigma^2\) is known

Confidence Interval for \(\mu\) when \(\sigma^2\) is known

Let \(x=(x_1, \ldots, x_n)\) be the realizations of the r.v. \(X = (X_1, \ldots, X_n)\) where the \(X_i\), \(i=1, \ldots, n\), are independent and identically distributed according to \(\mathcal{N}(\mu, \sigma^2)\). To obtain a confidence interval for the mean \(\mu\) when \(\sigma^2\) is known, we leverage the pivotal quantity: \[ T(\mu)=\frac{\overline X - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1), \quad \forall \mu \in \mathbb{R}, \sigma^2 > 0.\]

Then, it follows that for \(0<\alpha <1\) \[\rm{Pr}(-z_{1-\alpha/2} \leq T(\mu) \leq z_{1-{\alpha/2}})=1-\alpha \] and a confidence interval of level \(1-\alpha\) for \(\mu\) is given by \[\text{CI}^{(1-\alpha)}_\mu=\bigg(\bar x - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar x + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}\bigg)\] where \(-z_{1-\alpha/2}=z_{\alpha/2}\) and \(z_{1-\alpha/2}\) represent the quantiles of order \(\alpha/2\) and \(1-\alpha/2\) of a standard normal distribution, respectively.

Interval estimation for \(\mu\) under the Gaussian case, when \(\sigma^2\) is known

Monte Carlo simulations

Also for interval estimation, Monte Carlo simulation is well suited to explore the procedure:

B <- 1000     # Number of replications
n <- 10       # sample size
mu <- 5       # True population mean 
sigma <- 2    # True population standard deviation

alpha <- 0.05 # Confidence level: 1- alpha

# CI is matrix where we save the confidence intervals for each replication:
# -) first column: lower bound
# -) second column: upper bound
CI <- matrix(0, B, 2)

# l is a vector whose elements assume TRUE (1) or FALSE(0) depending on 
# whether the true parameter value lies within the interval
l <- rep(0, B)

Time to work with R: It’s your turn

  • Generate samples from the \(\mathcal{N}(\mu, \sigma^2)\).
  • Obtain the confidence interval for each replication.
  • Use a flag to detect if the confidence interval includes the true parameter value.

Interval estimation for \(\mu\) under the Gaussian case, when \(\sigma^2\) is known

Monte Carlo simulations

Also for interval estimation, Monte Carlo simulation is well suited to explore the procedure:

B <- 1000     # Number of replications
n <- 10       # sample size
mu <- 5       # True population mean 
sigma <- 2    # True population standard deviation

alpha <- 0.05 # Confidence level: 1- alpha

# CI is matrix where we save the confidence intervals for each replication:
# -) first column: lower bound
# -) second column: upper bound
CI <- matrix(0, B, 2)

# l is a vector whose elements assume TRUE (1) or FALSE(0) depending on 
# whether the true parameter value lies within the interval
l <- rep(0, B)
set.seed(1234)
q0975 <- qnorm(1 - alpha/2) # quantile of order 1-alpha/2 of N(0,1)
for(i in 1 : B) {
 x <- rnorm(n, mu, sigma)
 CI[i,] <- mean(x) + c(-q0975, q0975) * sigma/sqrt(n)
 l[i] <-  (mu > CI[i,1] & mu < CI[i,2]) 
}
mean(l)
[1] 0.955

Interval estimation for \(\mu\) under the Gaussian case, when \(\sigma^2\) is known

#Plot the first 100 c.i.:
# black: intervals not including mu
# red: intervals including  mu
plot(1, xlim = c(0, 10), ylim = c(0, 11), type = "n", 
     xlab = expression(mu), ylab = "", yaxt  = "n", 
     main = paste("100 IC for the mean (known variance)"), cex.main = 1.2)
abline(v = mu)

d <- 0
for(i in 1 : 100){
  d <- d + 0.1 
  lines(seq(CI[i, 1], CI[i, 2], length = 100), rep(d, 100), col = (l[i] + 1))
}
# number of intervals (out the 100) including the true parameter value 
print(sum(l[1 : 100]))

Interval estimation for \(\mu\) under the Gaussian case, when \(\sigma^2\) is unknown

Confidence Interval for \(\mu\) when \(\sigma^2\) is unknown

Let \(x=(x_1, \ldots, x_n)\) be the realization of the r.v. \(X = (X_1, \ldots, X_n)\) where the \(X_i\), \(i=1, \ldots, n\), are independent and identically distributed according to \(\mathcal{N}(\mu, \sigma^2)\). To obtain a confidence interval for the mean \(\mu\) when \(\sigma^2\) is unknown, and it is estimated by using the sample variance estimator \(S^2\), we leverage the pivotal quantity: \[ T(\mu)=\frac{\bar X - \mu}{S/\sqrt{n}} \sim t_{n-1}, \quad \forall \mu \in \mathbb{R}, \sigma^2>0 \]

Then, it follows that for \(0<\alpha <1\) \[\rm{Pr}(t_{n-1;\alpha/2} \leq T(\mu) \leq t_{n-1;1-{\alpha/2}})=1-\alpha \] and a confidence interval of level \(1-\alpha\) for \(\mu\) is given by \[\text{CI}^{1-\alpha}_\mu=\bigg(\bar x - t_{n-1; 1-\alpha/2} \frac{s}{\sqrt{n}}, \bar x + t_{n-1;1-\alpha/2} \frac{s}{\sqrt{n}}\bigg)\] where \(-t_{n-1;1-\alpha/2}=t_{n-1;\alpha/2}\) and \(t_{n-1;1-\alpha/2}\) represent the quantiles of order \(\alpha/2\) and \(1-\alpha/2\) of a t-student distribution with \(n-1\) degrees of freedom, respectively.

Confidence interval for \(\mu\) when \(\sigma^2\) is unknown

# consider the same setting as above: 
# mu = 5, sigma=2, n=10, B=1000, alpha=0.05
CI <- matrix(0, B, 2)
l <- rep(0, B)

set.seed(1234)
q0975 <- qt(1 - alpha/2, n - 1) # quantile of order 1-alpha/2  of t(n-1)

# -) Generate samples from the N(mu, sigma^2)
# -) Obtain the confidence interval for each replication 
# -) Use a flag to detect if the confidence interval 
#    includes the true parameter value
for (i in 1 : B){
 x <- rnorm(n, mu, sigma)
 CI[i, ] <- mean(x) + c(-q0975, q0975) * sd(x)/sqrt(n)
 l[i] <- (mu > CI[i,1] & mu < CI[i,2]) 
}

Confidence interval for \(\mu\) when \(\sigma^2\) is unknown

#Empirical coverage probability
mean(l)

# Plot of the (first 100) CIs
plot(1, xlim = c(0,10), ylim = c(0,11), type = "n", 
     xlab = expression(mu), ylab = "", yaxt  = "n", 
     main = paste("100 IC for the mean (unknown variance)"), cex.main = 1.2)
abline(v = mu)

d <- 0
for (i in 1 : 100){
  d <- d + 0.1 
  lines(seq(CI[i, 1], CI[i, 2], length = 100), rep(d, 100), col = (l[i] + 1))
}
# number of intervals (out the 100) including the true parameter value 
sum(l[1 : 100]) 

Comparing two population means or proportions

Comparing two population means or proportions

Two population means or proportions

We have two populations and we collect data about samples from these populations. Let’s denote with \(y_{i1}, i = 1, \ldots, n_1\), a certain observed characteristic in the first sample and let \(y_{i2}\), \(i=1, \ldots, n_2\), the measurements on the same characteristic of the second sample. Here, we are following Section 4.5 of Agresti and Kateri’s book.

Confidence interval for comparing two means

Let’s consider a quantitative variable. Goal: compare the population means \(\mu_1\) and \(\mu_2\)

  • Parameter: \(\mu_1 - \mu_2\).
  • Parameter estimate: \(\bar y_1 - \bar y_2\).
  • Estimator: \(\bar Y_1 - \bar Y_2\).

Suppose to assume \(Y_{i1} \sim \mathcal{N}(\mu_1, \sigma^2_1)\), \(i=1, \ldots, n_1\) and \(Y_{i2} \sim \mathcal{N}(\mu_2, \sigma^2_2)\), \(i=1, \ldots, n_2\), with \(Y_{i1}\) and \(Y_{i2}\) independent. We will also assume \(\sigma^2_1 = \sigma^2_2 = \sigma^2\). Note that

  • You cannot/don’t assume the normality: you could consider the CLT
  • It does not hold \(\sigma^2_1=\sigma^2_2=\sigma^2\): there exists the Welch test (we can also verify by means of the hypothesis testing if this relation holds)
  • In paired sample, under \(n_1=n_2\), \(Y_{i1}\) is not independent from \(Y_{i2}\): you can still work on the difference

Comparing two population means or proportions

Two-sided confidence interval for the difference of two means

Let \(\overline Y_1\) and \(\overline Y_2\) be the sample mean for the two groups, respectively. Let us identify the pivotal-quantity

You just know that \(\sigma^2_1=\sigma^2_2= \sigma^2\) but it is a parameter. You need to estimate it. Let’s define the pooled variance estimator

\[S^2= \frac{(n_1-1) S^2_1 + (n_2 - 1) S^2_2}{n_1 + n_2 - 2}\]

\[T = \frac{\overline Y_1 - \overline Y_2 - (\mu_1 - \mu_2) }{S\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \]

Now we know that \(T \sim t_{n_1+n_2-2}\) and the \((1-\alpha)%\) confidence interval for \(\mu_1-\mu_2\) is given by \[ CI ^{1-\alpha}_{\mu_1-\mu_2}= (\bar y_1 - \bar y_2 - t_{n_1+n_2-2; 1-\alpha/2} \times s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}, \bar y_1 - \bar y_2 + t_{n_1+n_2-2; 1-\alpha/2} \times s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}})\]

Example: Confidence interval for comparing two means

Example

Let us consider this example. A study makes use of a cognitive behavioral therapy to treat a sample of teenage girls who suffered from anorexia. The study, like most such studies, also had a control group that received no treatment. Then researchers analyzed how the mean weight change compared for the treatment and control groups. The girls in the study were randomly assigned to the cognitive behavioral therapy (Group1) or to the control group (Group2).

Example: Confidence interval for comparing two means

Example

Let \(\mu_1\) and \(\mu_2\) denote the mean weight gains (in pounds) for these groups.

Anor <- read.table("http://stat4ds.rwth-aachen.de/data/Anorexia.dat", 
                   header=TRUE)
# Get difference post-pre treatment for the group cb and c 
cogbehav <- Anor$after[Anor$therapy == "cb"] - Anor$before[Anor$therapy == "cb"]
control <- Anor$after[Anor$therapy == "c"] - Anor$before[Anor$therapy == "c"]


# Get the 95% CI via t.test function   
res <- t.test(cogbehav, control, var.equal = TRUE, conf.level = 0.95)
res$conf.int
[1] -0.680137  7.593930
attr(,"conf.level")
[1] 0.95
  • The mean weight change for the cognitive behavioral therapy could be as much as 0.68 pounds lower or as much as 7.59 pounds higher than the mean weight change for the control group.
  • The interval includes 0: it is plausible that the population means are identical (we can also see the results underlying the hypothesis test).
  • The confidence interval is relatively wide: sample sizes are not large.

Time to work with R: It’s your turn

Obtain the confidence interval by hand.

Example

Time to work with R: It’s your turn

Obtain the confidence interval by hand.

n1 <- length(cogbehav)
n2 <- length(control)

s2 <- ((n1 - 1) * var(cogbehav) + (n2 - 1) * var(control))/(n1 + n2 - 2)
  
CI <- mean(cogbehav) - mean(control) + c(-1,1) * qt(0.975, df = n1 + n2 - 2) *  sqrt(s2 * (1/n1 + 1/n2))
CI 
[1] -0.680137  7.593930

Example: Confidence interval for comparing two proportions

Example

For two groups,let \(\pi_1\) and \(\pi_2\) denote population proportions for a binary variable of interest. Suppose to have independent samples \(Y_{i1}\), \(i=1, \ldots, n_1\) and \(Y_{i2}\), \(i=1, \ldots, n_2\) from \(Be(\pi_1)\) and \(Be(\pi_2)\). To obtain a confidence interval of level \(1-\alpha\) for \(\pi_1 - \pi_2\), we leverage the pivotal quantity

\[ Z = \frac{\Pi_1 - \Pi_2 - (\pi_1 - \pi_2)}{\sqrt{\hat \pi_1 (1 - \hat \pi_1)/n_1 + \hat \pi_2 (1 - \hat \pi_2)/n_2}}\] which is approximately distributed as \(\mathcal{N}(0,1)\). Above, we denoted with \(\Pi_1\) and \(\Pi_2\) the sample proportion estimators. So the realisations of a confidence interval is given by

\[CI^{\pi_1 - \pi_2}_{1-\alpha} = (\hat \pi_1 - \hat \pi_2 \pm z_{1-\alpha/2} \sqrt{\hat \pi_1 (1 - \hat \pi_1)/n_1 + \hat \pi_2 (1 - \hat \pi_2)/n_2}) \]

Example: Confidence interval for comparing two proportions

Example

A study used patients at six U.S. hospitals who were to receive coronary artery bypass graft surgery. The patients were randomly assigned to two groups. For one group, Christian volunteers were instructed to pray for a successful surgery with a quick, healthy recovery and no complications. The praying started the night before surgery and continued for two weeks. The other group did not have volunteers praying for them. The response was whether medical complications occurred within \(30\) days of the surgery

Prayer/Complications Yes No Total
Yes 315 289 604
No 304 293 597

Example: Confidence interval for comparing two proportions

Example

Let \(\pi_1\) denote the probability of complications for those patients who had a prayer group. Let \(\pi_2\) denote the probability of complications for the patients not having a prayer group.

success <- c(315, 304)
total <-  c(604, 597)
res <- prop.test(success, total, conf.level = 0.95, correct = FALSE)
res$conf.int
[1] -0.04421536  0.06883625
attr(,"conf.level")
[1] 0.95
p1 <- success[1]/total[1]
p2 <- success[2]/total[2]

p1 - p2 + c(-1, 1) * qnorm(0.975) * sqrt(p1 * (1 - p1)/total[1] + 
                                           p2 * (1 - p2)/total[2])
[1] -0.04421536  0.06883625